p/ste/J QP'/TZ r?r 


NASA Contractor Report 177979 
ICASE REPORT NO. 85-39 

NASA-CR- 177979 
19860002389 


ICASE 


TOWARDS DEVELOPING ROBUST ALGORITHMS FOR SOLVING 
PARTIAL DIFFERENTIAL EQUATIONS ON MIMD MACHINES 


Joel H. Saltz 
Vi jay K. Naik 



NASA Contract No. NAS1-17070 vpy \ ]985 

September 1985 

LANGLEY RESEARCH CENTER 
LIBRARY, NASA 
HAMPTON, VIRGINIA 


INSTITUTE FOR COMPUTER APPLICATIONS IN SCIENCE AND ENGINEERING 
NASA Langley Research Center, Hampton, Virginia 23665 

Operated by the Universities Space Research Association 


NASA 

National Aeronautics and 
Space Administration 

Langley Research Center 

Hampton Virginia 23665 




1 Report No NASA CR- 17 79 79 2 Government Accession No 

ICASE Report No. 85-39 

3 Recipient's Catalog No 

4 Title and Subtitle 

TOWARDS DEVELOPING ROBUST ALGORITHMS FOR SOLVING 

5 Report Date 
September 1985 

PARTIAL DIFFERENTIAL EQUATIONS ON MIMD MACHINES 

6 Performing Organization Code 

j 

7 Author(s) 

Joel H. Saltz and Vijay K. Naik 

8 Performing Organization Report No 

85-39 

10 Work Unit No 

9 Performing Organization Name and Address 

Institute for Computer Applications in Science 
and Engineering 

Mail Stop 132C, NASA Langley Research Center 
Hampton, VA 23665 


11 Contract or Grant No 

NAS1-17070 

13 Type of Report and Period Covered 

r.nnt met- nr Bonn r t- 

12 Sponsoring Agency Name and Address 

National Aeronautics and Space Administration 
Washington, D.C. 20546 

14 Sponsoring Agency 6ode 

505-31-83-01 

15 Supplementary Notes Submitted to Parallel Computing 

Langley Technical Monitor: 

J. C. South Jr. 

Final Report 

16 Abstract 

Methods are proposed for efficient computation of numerical algorithms on a 
wide variety of MIMD machines. These techniques reorganize the data dependency 
patterns so that the processor utilization is improved. 

The model problem examined finds the time-accurate solution to a parabolic 
partial differential equation discretized in space and implicitly marched 
forward in time. The algorithms investigated are extensions of Jacobi and 
SOR. The extensions consist of iterating over a window of several timesteps, 
allowing efficient overlap of computation with communication. 


The methods suggested here increase the degree to which work, can be 
performed while data are communicated between processors. The effect of the 
window size and of domain partitioning on the system performance is examined 
both analytically and experimentally by implementing the algorithm on a 
simulated multiprocessor system. 


17 Key Words (Suggested by Author (s) ) 

MIMD, multiprocessor, shared memory, 
iterative methods, performance of 
parallel processors. 

1 8 Distribution Statement 

59 - Mathematical & Computer Sciences 
Unclassified - Unlimited 

19 Security Classif (of this report) 
Unclassified 

20 Security Classif (of this page) 
Unclassified 

21 No of Pages 

52 



22 Price 

A0 4 


N-305 


For sale by the National Technical Information Service, Springfield Virginia 22161 


















TOWARDS DEVELOPING ROBUST ALGORITHMS FOR SOLVING 


PARTIAL DIFFERENTIAL EQUATIONS (Hi MIMD MACHINES 

Joel H. Saltz 

Institute for Computer Applications in Science and Engineering 

Vi jay K. Naik 

Institute for Computer Applications in Science and Engineering 

Abstract 

Methods are proposed for efficient computation of numerical algorithms on 
a wide variety of MIMD machines. These techniques reorganize the data 
dependency patterns so that the processor utilization is improved. 

The model problem examined finds the time-accurate solution to a parabolic 
partial differential equation discretized in space and implicitly marched 

forward in time. The algorithms investigated are extensions of Jacobi and 
SOR. The extensions consist of iterating over a window of several timesteps, 
allowing efficient overlap of computation with communication. 

The methods suggested here increase the degree to which work can be 

performed while data are communicated between processors. The effect of the 

window size and of domain partitioning on the system performance is examined 

both analytically and experimentally by implementing the algorithm on a 
simulated multiprocessor system. 


Research was supported in part by the National Aeronautics and Space 
Administration under NASA Contract No. NAS1-17070 while the authors were in 
residence at ICASE, NASA Langley Research Center, Hampton, VA 23665. 

The first author was also supported in part by NASA Grant NAG1-466, 
Medical Scientist Training Program, NIH Grant HL-11307. 


i 




flU' //?•&> 



1 . Introduction 


It is widely recognized that parallel computation is essential to overcome 
the fundamental physical limits due to circuit switching and signal 
propagation delays on the computational speeds of sequential processing. 
Recent technological advances in the integration of electronic circuits has 
made it possible to conceive of large systems of processing elements working 
together in parallel on a single problem. In order to achieve high 
performance from such systems, the problem is divided into small tasks that 
can be solved in parallel. The flow of data between tasks must be matched to 
the characteristics of the interprocessor communication network so that tasks 
running on separate processors can cooperate in an efficient manner during the 
computation [H0CK.81] , [VOIG85] . 

In recent years most algorithms developed for parallel processing have 
been designed for implementation on vector computers or single instruction 
multiple data (SIMD) type systems, mainly because of the relatively wide 
availability of such machines [ORTE85] . Fewer algorithms have been developed 
specifically for implementation on multiple instruction multiple data (MIMD) 
type systems [VOIG85]. In such systems individual processors independently 
execute their own set of instructions and may operate asynchronously. In many 
MIMD machines that have been either built or proposed, each processor has a 
local memory to which it has relatively rapid access [HWAN85] . In this paper 
we discuss some algorithms and methodologies for efficient numerical 
computations on such systems. The processors may communicate by using a 
communication network or by passing messages through a shared memory. 

The design of any parallel algorithm involves partitioning of the problem 
into individual tasks which can be executed concurrently on the processors of 
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a parallel processing system. If the problem can be subdivided among the 
processors in such a way that each processor can proceed at its own rate in 
solving its portion of the problem, without requiring information from other 
processors, the coordination is simplified. In the cases where this is not 
so, algorithms and machines must be matched so that the necessary data are 
made available where they are needed when they are required. For numerous 
algorithms, the total time required to move data to the appropriate processing 
element is larger than or equal to the time required to perform the 
computation [GENT78] , [ORTE85], [ SAAD8 5 ] . 

Complications may arise when processors require data computed in other 
processors to continue performing useful work on a problem. A number of 

factors can conspire to create a situation m which data are not available 

where and when the data are needed. Two such factors are: 1) interprocessor 

communication delays may prevent a processor from receiving a piece of 

information available elsewhere, when required; 2) the computational abilities 
of the processors performing work on the problem and the portions of work 
assigned may not be matched. In this paper we concern ourselves with the 
first problem. The question of load balancing is addressed in [SALT85] . 

The degree and properties of interprocessor communication delays are 
dependent on the characteristics of the computer architecture on which the 
problem is run, the algorithm used to solve the problem concurrently, and the 
mapping of the problem onto the architecture. If the scheme employed for 
solving the problem in parallel has the property that results are needed for 
use in other processors soon after being computed, the scheme will be 
sensitive to delays in the transmission of data between the processors, and 
the utilization of the system as a whole is apt to be degraded. Here we 
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define the utilization of the system as the average amount of time spent by 
the processors in performing useful operations divided by the total time 
required to solve the problem. 

In this paper we will demonstrate that it is possible to reorganize 
computations in each processor so as to increase the amount of useful work 
that can be performed on the problem while each processor waits for data from 
other processors. This study will be based on a model problem of obtaining 
the time-accurate solution to a linear parabolic partial differential equation 
discretized in space and implicitly marched forward in time. The algorithms 
investigated are iterative methods that are extensions of block Jacobi and 
SOR. 

In the next section we discuss the general principles involved in parallel 
implementations of basic iterative methods such as Jacobi, Gauss-Seidel, and 
SOR methods. In Section 3 the model problem is introduced. After showing the 
difficulties due to the communication delays encountered in applying the 
parallel versions of basic iterative methods, we present two algorithms which 
are extensions of Block Jacobi and Block SOR and which allow efficient overlap 
of computation with communication. Schemes for implementing these algorithms 
on multi-processor systems are discussed in Section 4. In Section 5 we derive 
bounds on processor utilizations for such implementations. These algorithms 
are implemented on a simulated shared memory machine and the results are given 
in Section 6. Experimental results on convergence and overhead of the 
algorithms developed in Section 3 are presented in Section 7, and it is shown 
that for the cases considered, the overhead is moderate compared to the gain 
obtained in employing these algorithms. Finally, concluding remarks are 


presented in Section 8. 
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2. Parallel Iterative Methods 

In this section the parallel implementation of basic iterative methods 
will be outlined. For specificity, we consider the standard uniform five 
point difference approximation to Laplace's equation given by 


u xx + u yy ® 


in a square domain subject to Dirichlet boundary conditions. The difference 
equation obtained when the mesh is n by n is 




= r K- 


i+l,J 
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i-l.j 
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The subscripts 1 and j range from 1 to n; and u t n+ j , u 1 q, u n+ | and 
un are given constants. Notice that the equation for each mesh point 
involves data at that point as well as the point's north, south, east, and 
west neighbors. 

The Jacobi method for the solution of this system of difference equations 
may be written as 
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where the superscript denotes the iteration number. The Jacobi method often 
converges very slowly, but is considered to be a prototype parallel method 
[ORTE85], On multiprocessor computers, single mesh-points or groups of mesh 
points are assigned to each processor. All points are updated each iteration, 
and since only the values from the last iteration are utilized, each step of 
the iteration may proceed to completion without any need for communication of 
data values during the step. 
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The Gauss-Seidel and SOR methods require new values at each point to 
replace the old values as soon as the new values are computed. In sequential 
computers the points in the domain are generally processed one iteration at a 
time, point by point and line by line. Because of the need for the new values 
of the variables at the previously computed points, only one mesh point can be 
dealt with at a time; hence, the process is not suited to multiprocessor 
machines. To deal with this problem the mesh points are grouped into two 
subsets such that no points in a subset are coupled to other points in the 
same subset [ERIC72], [HAYE74], [LAMB75], [ADAM82], The most common way of 
obtaining these sets is to use the classical red/black ordering [Y0UN71]. For 
the model domain this is obtained by assigning points to the black subset 
when i+j is odd, and to the red subset when i+j is even. Each iteration 
can then proceed in two separate phases where each phase has the properties of 
a Jacobi sweep in that all mesh-points in a given phase may be adjusted 
independently. Other iterative methods such as the chaotic or asynchronous 
relaxation involve the updating of mesh points with variable values that need 
not come from the previous iteration [CHAZ69], [BAUD78], [DEMI82]. In the 
simplest form of chaotic relaxation, each processor uses whatever values are 
available to compute the next value of the iterate at a given point, without 
any restriction on the iteration number of the variable value chosen. 

It is possible to implement Jacobi, Gauss Seidel, and SOR iterative 
methods so that all new values corresponding to a group of variables are 
determined at once. Such schemes are referred to as block iterative methods. 
The simultaneous determination of the values of groups of variables involves 
the solution of subsystems of equations, generally by direct methods. Under 
specific conditions the block iterative methods are known to converge at a 
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rate faster than the corresponding point iterative methods [Y0UN71], [VARG62] . 
Various solutions are proposed to introduce parallelism in the block SOR 

methods, including the scheme of partitioning the domain into red and black 

blocks [ERIC72] , [LAMB75] , [ PART80] , [PART82] , [FABE81 ] . 

3. Multistep Iterative Methods 

In many algorithms for the solution of time dependent problems, it is 
necessary to solve a sequence of linear systems of equations involving the 
same matrix but different right hand sides. In these cases, the right hand 
sides of the consecutive systems of equations are dependent on the solutions 
of the equations earlier in the sequence. The implicit solution of the 

parabolic partial differential equation using the Crank-Nicholson method in 

time gives rise to one such algorithm. Typically when an iterative method is 
applied to a time dependent problem, iterations proceed successively over the 
systems of equations for each timestep [HAGE81]. Since the consecutive 
systems of equations depend on the solutions of the earlier equations, such 
implementations limit the extent of possible parallelism. In this section we 
present algorithms which exploit a new level of parallelism in solving time 
dependent problems. We achieve this by traveosing more than one timestep 
during the course of a single iteration. TbSSse schemes allow coarse grained 
multiprocessor implementations, which are minimal Ly dependent on machine 
characteristics and have favorable, well-defined convergence properties. 

In the following, the multistep parallel -iterative methods wiLl first be 
discussed in a relatively synchronous conte*S? so that the ^nature of the 
numerical algorithm and the most straightforward implementations can be 
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understood. This will be followed by discussions of ways in which the 
algorithms described can be implemented so as to allow for less synchronized 
implementations and high processor utilizations. 


3. 1 Block Jacobi Iteration 
We consider the system 


My = b 


where M is an n by n matrix and y and b are vectors of length n. We 
assume for the remainder of the section that the above system is partitioned 
in the form 
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where the M^'s are submatrices and y^ and b^ represent subvectors. 
The Block Jacobi method is defined as follows, 


M. 

i,i 


v+1 


= - l 




M . y V + b. 
i.J J 1 


where v represents iteration number. All of the updated values at the 

v+lst iteration are obtained using values from the vth iteration. The 

important special case when M, are 1 by 1 submatrices gives rise to the 

L > J 


point Jacobi method 
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3.2 Multistep Block Jacobi Algorithm 

Now consider the following set of systems of equations. 

My(t f ) = Py(t j __ 1 ) + b(t^) , r = 1 , • • • ,m (1) 

where M and P are n x n matrices, y(t ) and b(t ) are vectors of n 

r r 

variables and t^_ is the r 1 "' 1 timestep. Shortly we will show how these 
equations arise m the solution of time dependent partial differential 
equations as well as the integration of first-order ordinary differential 
equations . 

The solution of these equations with standard block or point iterative 
methods involves the consecutive solution of each of the m systems of linear 
equations corresponding to the m timesteps. After iterating over each 
timestep until satisfactory convergence is obtained, one moves on to iterate 
over the next timestep. On a sequential machine, this is, in fact, the usual 
way an iterative method is applied to a time dependent problem [HAGE81], In 
Figure 1 this algorithm is depicted schematically and a parallel version is 
outlined below in an algorithmic form. Here 'q' is the number of partitions 
into which the domain is subdivided. 
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Jacobi Method Iterated Until Convergence over Each Tinestep 


For r=l to number_of_timesteps 

t 

Pardo i = 1 to q 


{ 


1 


' MW 


Beginning with v = 0 increment v until lly v+ ^(t ) - y v (t ) II < e 

{ r r 
Pardo i = 1 to q 
{ 

Solve: 

M y V+ ' - (t ) = - V M. y v (t ) + y P. y . (t , ) + b. (t ) 
i,i J i r ^ x,j } 3 r i,J *3 r-1 i v r' 

} 

Pardo i = 1 to q 

^ \ v+1. \ 

y.(t ) = y (t ) 

J i r 3 i r 


In the above algorithm as well as in the subsequent algorithms depicted 
below, we assume that the matrix P is decomposed into blocks using a 
partitioning identical to that already used for M. The subraatrix P^j hence 
includes the same rows and columns of P as M x does of M. Each vector 
bCt^) is decomposed into subvectors b^(t r ). 

Instead of iterating until convergence over each timestep before moving 
on to the next, it is possible to iterate over a number of timesteps at 
once. The set of timesteps over which these extended iterations occur is 
called a window. After a fixed number of iterations have taken place over the 
window, or after convergence has been obtained at the first timestep, the 
window moves up one timestep. When the window shifts upwards, the value of 
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the solution at the timestep that was at the top of the old window is treated 
as the initial approximation at the new timestep which is now at the top of 
the shifted window. At the beginning of the iterations, the initial value 
specified for the problem is used as the initial guess at all timesteps in the 
initial window. This multistep generalization of Block. Jacobi iteration is 
called Windowed Block Jacobi algorithm (WBJ). This is schematically depicted 
m Figure 2. 

In WBJ an iteration takes place when possible over a window of w 
timesteps. The term microstep describes the relative position of a timestep 
with respect to the beginning of a window. A timestep that is i timesteps 
above the beginning of a window will be designated as the ith microstep of the 
window. The number of microsteps advanced from the beginning of iterations 
performed over a window is designated as the number of cumulative microsteps 
(cms). In Figure 3 the process of measuring cumulative micros teps is 
illustrated by means of an example. The concept of cumulative microstep is an 
essential tool for the development and evaluation of techniques described in 
this paper. Note in the following algorithm that iterations are never 
performed on the timesteps beyond 'number of_timesteps' specified for the 
problem. Consequently, the window size during the solution of the last few 
timesteps may correspondingly be reduced. The following is an explicit 
outline of WBJ. As before q is the number of partitions, and w is the 
maximum number of timesteps in a window. 'Window' is the actual number of 
timesteps used in a window. 
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WBJ - Windowed Block Jacobi Method 


Pardo 1 = 1 to q 


{ 


Pardo s=l to window 

{ 

y i (t s ) = y i (t o ) 


} 


} 


For r=l to number_of_timesteps 

{ 


Beginning with v=0 increnent v until lly v+1 (t ) - y v (t ) II < e 
{ r r 
Pardo i = 1 to q 


{ 


window = min(w, number of timesteps - r + 1) 
for s = r to r + window - 1 
{ 

Solve : 

M. . yT +1 (t ) = - y M . y V (t ) + 
i.i- i s L 1,3 } 2 s 




. yT ^ ( c ,) + 7 P y V (t . ) + b (t ) 

i,i J 1 s-1 l i,j s-r i s 7 




} 

Pardo l = 1 to q 


{ 


1, . v+1, 

y(t ) = y (t ) 

i r+window y i r+window-1 

, v+1, . 

y.(t)=y. (t) 

J l r J i r 


In both the algorithms discussed above, during iteration v+1 each partition 
i utilizes variables yj, j*l from other partitions. This procedure allows 
simultaneous computation of all partitions on a multiprocessor system. It 
should be noted that there may be several blocks assigned to each processor. 
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The pardo statements above simply serve to denote the existence of a number of 
subtasks that can run independently. 

In [SALT85] it Is shown that the spectral radius of the iteration matrix 
of WBJ for any window w is identical to that of the corresponding block 
Jacobi method; hence, the asymptotic convergence characteristics of the 
algorithms are comparable. In the Jacobi method, it is necessary to calculate 

q 

I P . y (t , ) only once for each timestep t over which the solution is 
i.J J s-l K s 

integrated. In WBJ this matrix vector multiplication must be performed only 
once when a timestep is at the bottom of a window, but it must be performed 
each time a timestep is above the lowest part of the window. Therefore one 
expects that the overhead involved in the new methods will be moderate when 
large blocks are utilized but quite high when the matrix is decomposed into 
very small blocks or points. In the next section, however, it will be shown 
that a minor reorganization of the computations can considerably reduce the 


overhead. 


Note that instead of utilizing solution values at earlier timesteps as 
initial approximations at the top of a new window, one may utilize explicit 
integration methods. In numerical experiments this was shown to aid 
convergence but in some cases to adversely affect stability. 

We now explicitly show how WBJ may be used to solve the set of ordinary 
differential equations obtained from the spatial discretization of a parabolic 
partial differential equation. Note, of course, that this method would apply 
to a system of differential equations obtained in other ways as well. This 
system may be represented as 


y = Qy + c(t) 
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where y, y and c(t) are vector valued functions and Q is a matrix of 
constants. Assume that y(t^) is given as an initial condition. 

Applying the Crank Nicholson integration method to discretize in time we 
obtain. 


(y(t r ) ~ = (Q/2)(y(t r ) + y(t r _ 1 >) + c(t r )/2 + cCt^p/2. 


To advance from Cq to t m we solve Eqns. (1) above with the following 
definitions of M, P, and b(t r ): 


M = (I/At - Q/2), P = (I/At + Q/2) 


and 


b(t r ) = c(t r )/2 + c(t r _ 1 )/2, 


r — 1, • • • ,m. 


Let Qi,j represent the partition of Q that corresponds to the same 
rows and columns as does the partition M,- * of M and P< j of P. Recall 
from the earlier discussion that the equations for y^ + * at each timestep 
t g within the window are given by, 


M 


i,i J t 


v+1 , . 

y« (O = 


- I M ± j 

j*i 


y V (t ) 
J s' 


► l ? 


i,j 


^ (t s-i> + p i,i *r 1(t 


S-P + b i (t 8 ) - 

(2) 


This may be rewritten in the case of this specific problem as follows: 
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( i/At ■ Q i,i / 2 ) y r l(t s ) = ( i/&t + Q 1 ,i / 2 ) y r 1 (t s-i ) 


+ J x ( (Q i,J / 2 )(y j (t s ) + y j (t s-l } )) + b i (t s ) ' 


(3) 


To realize the advantages of this rearrangement, consider the case that 

would appear to be the most problematical when (2) is applied, that is, when 

the block size is one, which corresponds to a point iterative method. 

Consider the computational overhead arising from iterations over timesteps 

t g that are above the lowest point in a window. By applying (2) an extra 

matrix vector multiplication appears to be required per iteration over each of 

these timesteps. In applying (3), y v (t ) + y v (t _. ) needs to be computed 

J 13 J s t 

only once per iteration. Hence, assuming that (I/At + ^ /2) has been 

precomputed, only one additional multiplication and two additions need be 

performed per variable for each iteration over t y . When a number of 

processors require the value y v (t ) + y v (t ,), one would expect it to be 

J s j s 1 

computed separately in each processor. 


3.3 Windowed-Block Successive Overrelaxation Algorithm 

One may refine WBJ with the aim of improving convergence by employing a 
multistep version of Block Successive Overrelaxation (BSOR). Partitioning is 
performed as described in the previous sections and, as before, each partition 
is advanced w timesteps during each iteration of the algorithm. In this case, 
however, the blocks are solved consecutively and estimated values of variables 
calculated from other partitions are used as soon as they become available. 
The other refinement involves the use of overrelaxation. A weighted average 
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of the old and the new variable values for each partition is employed; the old 
variable values are assigned a negative weight. This multistep generalization 
of 8S0R is called Windowed-Block Successive Overrelaxation (WBSOR). Following 
is a formal presentation of WBSOR. Meanings of q, w, and window are as 
before. 


WBSOR - Windowed-Block Successive Overrelaxation 


Pardo i = 1 to q 

{ 

Pardo s=l to window 

{ 

y i (t s } = y i (t 0 ) 

} 

} 

for r=l to number of timesteps 


{ 


Beginning with v=0 increment v until lly V+ *(t ) - y V (t ) II < e 

{ r r 

for 1 = 1 to q do in parallel when possible 

{ 

window = tnin(w, number of timesteps - r + 1) 
for s = r to t + window - 1 

{ 

Solve: 


M 


, yT +1 (t ) = - 1 M y V+ ^(t ) + J P y V+ *(t ) 

1,1 1 S ^ 1,J S l,] J 8-1 


- Y M yT(t ) + l P yT(t ) + b (t ) 
1,J J s ^ l.J 8-1 1 S 


y^ +1 (t g ) = oiy^ +1 (t g ) + (l-aj)yY(t g ) 


Pardo i = 1 to q 


^i^r+windoi y i ^r+window-l ^ 

W " y I + 1 (t r ) 


} 


} 
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If the blocks or partitions are chosen properly, it is possible to order 
them so that one obtains two sets of blocks such that all blocks in a given 
set are uncoupled from one another. By convention the blocks in one set are 
designated as black blocks, and those in the other are designated as red 
blocks. For example, in the case of a two-dimensional spatial domain the grid 
of points on which the problem is to be solved may be partitioned into strips. 
The partitioning should be done in such a way that each strip is coupled at 
most to the two adjacent strips. The strips are assigned to two sets such 
that adjacent strips are in separate sets. 

In [SALT85] the spectral radius of the iteration matrix of WBSOR is shown 
to be independent of the window size used. It should be noted that WBSOR with 
window size one is simply BSOR. The use of WBSOR with any window size yields 
convergence that is asymptotically identical to that obtained with BSOR. 


4. The Implementation of WBJ and WBSOR on Multiprocessor Systems 

In this section we consider the implementation of WBJ and WBSOR on multi- 
processor machines. Here we assume that each processor of the system has a 
substantial local memory. Information in this local memory may be accessed 
more easily and inexpensively than information in the local memory of another 
processor or in any global memory that might exist. The submatrices required 
to carry out the computations of WBJ and WBSOR on particular portions of the 
domain of the partial differential equation being solved should be quickly 
obtainable from memory. Responsibility for work involving particular blocks 
or portions of the domain is consequently assigned to a specific processor. 
To reduce interprocessor communication requirements, blocks which share 
coupling variables should be assigned to the same processor when possible. 
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We first consider a straightforward multiprocessor implementation of 
WBJ. A collection of blocks is assigned to each processor. Each processor is 
programmed so that it may advance all blocks when the variable values required 
have been obtained from other processors. Upon computing new approximations 
to the partial differential equations at a given timestep, the processor sends 
values of the computed variables to other processors that require the 
information. The operations performed by the processors and the patterns of 
interprocessor communication when the blocks are of size one by one (i.e., 
points), are similar to those described in [REED84] . 

In WBJ, as presented above, synchronization between all processors must 
occur whenever the window shifts upwards. This requirement exists in order to 
ensure that convergence has occurred at the lowest timestep in the window 
before work on that time timestep comes to a halt. In order to advance to the 
nth iteration mth microstep where n > 1, each processor requires coupling 
variable values from the n-lst iteration, mth microstep. It is clear that all 
processors need not be performing work on the same iteration and the same 
raicrostep simultaneously. Thus, as long as the advancement of individual 
processes is based on the availability of appropriate coupling variable 
values, no global synchronization is necessary. Furthermore, as the window 
size of WBJ increases, the degree to which the processors are constrained by 
this implicit synchronization requirement decreases. The flow of data in a 
two processor system with a window of three is illustrated in Figure 4. 

We now consider a multiprocessor implementation of WBSOR. Henceforth, it 
will be assumed that the points or blocks involved in this algorithm are 
ordered using the red-black ordering. As explained before in WBSOR, as in 
WBJ, synchronization is performed among all processors whenever the window 
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shifts upwards. Blocks may be assigned to processors in one of a number of 
ways. The nature of the synchronization that must occur is dependent on the 
way in which blocks of differcit colors are assigned -0 processors. Consider 
first the case in which proc >ssors contain either culy red blocks or only 
black blocks. In order to advance to the nth iterat in, mth microstep where 
n > 1, each processor conta ning black bLocks req .ires coupling variable 
values from processors conta: ling red blocks from t .e n-lst iteration, mth 
microstep. Each processor c >ntaining red blocks requres coupling variable 
values from processors containing black blocks from the same iteration and the 
same microstep. 

With a window of size one, i.e., the standard red-black BSOR scheme, 
processors can only be active half of the time given an assignment of only one 
color block in each processor. This difficulty disappears when a window size 
of two or more is utilized, and with larger windows the tightness of the 
coupling between processors continues to decrease. The assignment of multiple 
red blocks to some processors and of multiple black blocks to others has a 
potentially serious side effect. Since black blocks are coupled only to red 
blocks and vice-versa, values of all of the coupling variables in every block 
must be obtained from other processors. In the po: it iterative version of 
WBSOR, this arrangement would require the communicati in of the value of every 
variable for each microstep computed at each iteration over every window. 

One may assign both red and black blocks to ear i processor and may thus 
be able to substantially reduce the interprocessor co munication requirements. 
Consider the case in which one wishes to solve a time dependent partial 
differential equation diffen need with a five point template. The domain of 
the partial differential equation is divided into reg ons and the variables in 
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each region are assigned to blocks. All blocks in each region are assigned to 
a particular processor. With this arrangement only the values of the 
variables corresponding to the mesh points at the boundaries of the regions 
need to be communicated regardless of the size of the blocks. 


5. Analysis of Communication Delay Effects 

This section demonstrates the usefulness of windowing techniques in 
ameliorating the effects of communication delays. The delay in sending a 
message from one processor to another may be written in the form ci + B * 
size, where a and ft are some parameters, and 'size' is equal to the number 
of bytes in the message. The nature of communication delays depends both on 
the multiprocessor architecture and the demands on the communication network 
made by the algorithm being run on the multiprocessor. In this paper, 
interprocessor communication delays will be modelled in the following two 
ways : 

(1) Uniform interprocessor communication delays that vary only with the size 
of each message, i.e., a and 8 are constants, and 

(2) detailed simulation of a specific family of multiprocessor architectures. 
The detailed simulations directly model any queuing effects that occur; 
hence, communication delay increases as a function of the number of 
messages sent. 

An upper bound of utilization for WBJ and WBSOR is derived below. The 
upper bound calculations assume uniform interprocessor delays as described 
above. It will be shown that this upper bound decreases with interprocessor 
communication delay but increases with window size. 
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First a simple lower bound on the time required for each PE to advance all 
of its blocks m cumulative microsteps will be derived. This bound is a linear 
function of the interprocessor communication lag and is inversely proportional 
to the window size. From this lower bound, an upper bound on processor 
utilization is obtained. It is assumed that communication lags are constant 
and unrelated to the amount of information sent. 

Fix attention on a given PE P, in a multiprocessor system. We consider 
the execution of WBJ and WBSOR with window size w, on a multiprocessor system. 
Assume that to advance all the blocks in a given processor P one cumulative 
microstep, a total time T would be required. Assume further that communica- 
tion with other processors requires time cT. The variable c will be called 
the communication delay. Consider a boundary block B m P which requires 
coupling variable data from some other processor or processors. 

Proposition 1: The boundary block B requires at least 2cT time to 
advance from the end of eras i _to_ eras l+kw, where k = 2 for WBJ and 
k= 1 for WBSOR . 

Proof : The boundary block B at ems l is , by definition, coupled to at least 
one block B' in another processor. Consider first the case of algorithm 
WBJ. In order for B' to advance to ems i+w it requires variable values from 
B at ems l. Time cT is required for this information to get to B'. After 
B' has computed its results for ems i+w it must send those results to B. 
This takes time cT. B cannot advance to ems i+2w until the results from B' 
corresponding to ems i+w arrive. Thus a lower bound on the time required for B 
to advance from ems i to ems i+2w is 2cT. A similar proof in the case of 
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WBSOR demonstrates that block B requires at least 2cT to advance from cms i to 
cms i+w . 

The above proposition provides a lower bound on the time T t t that 

processor P takes to complete its work. Since P must advance all of its 
blocks m cumulative microsteps, it cannot complete its work in less than time 
2cT L^J- Now mT is the computation time required to advance all blocks of 
P, m cumulative microsteps. The total time to complete the problem is hence 
bounded below as follows 

T > maxImT, 2cT 
tot \ 



The utilization U p of the processor P defined as 
accordingly subject to the following upper bound 


computation time 
T 

tot 


is 


/ 2c j 

1 ra j 


max 1 1 , — 
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A very simple asymptotic form will be derived which bounds the 
achievable, by a function dependent only on the window size 
communication delay c. Note that, 
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In the limit of large m, 
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and if c < j — then Up < 1. Thus, the communication delay that can be 

tolerated before a bound on utilization occurs increases linearly with w. 

If c > — , then U < ^ . This means that as the window size increases, the 
2 p 2c 

sensitivity of the processor to communication delays is reduced. Note that 
for WBJ, k = 2 and for WBSOR, k = 1 and hence for any window size, WBSOR is 
more sensitive to communication delays than WBJ. 

The upper bounds derived above do not depend on the amount of time 
required to advance any of the blocks in the system. A refinement of these 
upper bounds may be obtained by taking into account the time required to 
advance boundary blocks. Refined upper bounds will be computed for WBSOR 
below; the same principles could be utilized to refine the upper bounds for 
WBJ. 

Assume that all boundary blocks in all PEs require computation time T rom p 
for advancement. 

Proposition 2: The boundary block B requires at least 2cT + 2 T cq 

time to advance from the end of cms i _to^ cms l+w. 

The above proposition is proved exactly as was proposition (1) above, 
except that: (1) block B' must now compute for time T com p before sending 
variable values to B and (2) after receiving data from B' , B must compute 
for time T coni p before it can make variable values available at cms l+w. 

Reasoning exactly as before and substituting k = I, 


T _ > max 

mT, [2cT + 2T ) 

m 


tot 

\ comp J 

w 
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and hence the utilization bound is 


U 

P 



For large m, the asymptotic utilization bound is 


U 

P 


< 


max 1 1 , 



(4) 


6. Simulation Results 

Detailed simulations of a particular architecture were performed, and the 
algorithms developed here were implemented on this simulated machine to 
examine the effects of windowing on the system performance. Results obtained 
from these simulations for the algorithm WBSOR pertaining to the effect of 
communication delay and window size on processor utilization are outlined in 
this section. In all sets of the simulations we assume that blocks consist of 
strips of the domain (Figure 5) and that only the neighboring strips have 
coupled variables. All processors have the same number of blocks assigned to 
them, and the blocks assigned to a given processor are physically adjacent to 
one another. 

6. 1 A Simulated Shared Memory Machine 

A low level simulator called SIMON was employed to simulate a shared 
memory architecture. SIMON is an event-driven multi-processor simulator 
consisting of time-sorted queues [FUJI83] , [HELL84] . It is capable of 
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providing nano-second precision and allows the user to control the timing at 
the instruction level. Utility functions are provided to define multi- 
processor architecture, send and receive messages, etc. When a message is 
sent between processes, its arrival time is determined from the send time and 
a delay representing travel through the interprocessor connections and 
switches. It allows the user to control the computation costs at each 
processor as well as the costs involved m interprocessor communication. 

The multi-processor system simulated here consists of a number of 
processing elements and a global shared memory. Each processing element (PE) 
contains a central processing unit and a substantial local memory. The 
instructions and data corresponding to the tasks assigned to a processor 
reside in its local memory and the processor alone has direct access to this 
memory. The global shared memory is made up of a number of modules and these 
are accessible to all the processors with equal priorities. The processors 
are connected to the modules through a crossbar switch. The processors 
communicate with each other by reading and writing data m the shared memory 
modules. The access to these modules is brought about by means of 
input/output handlers (1/0-handler) and an input/output processor (i/o- 
processor). Attached to each PE is an 1/0-handler which takes care of the 
read/write operations associated with the shared memory and allows its host PE 
to continue performing computations. The shared memory and the crossbar of 
the system are controlled by the i/o-processor . Only one i/o-handler is 
allowed to read or write to a particular memory module at a given time, 
although a number of i/o-handlers can read or write to distinct memory 
modules. The i/o-processor arbitrates the access to the shared memory modules 
by different i/o-handlers through the crossbar switch. 
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The communication that occurs in this model machine involves either: 1) 
messages concerning requests and permissions to read or write, or 2) messages 
that include variable values that must be transmitted and received. The 
requests and permission messages are small, consisting of only a few bytes. 
The transmission of variable values requires messages that are generally much 
longer, and their size depends on the number of variables that are shared. 

The operation of the modeled shared memory machine is carried out as 
follows. When a processor needs to read or write data to the shared memory, 
it sends a message to its l/o-handler. This message designates whether a read 
or a write is requested and also designates the memory module required. When 
the l/o-handler receives this message, it forwards the request to the 1 / 0 - 
processor. The l/o-processor collects and queues requests to read and write 
sent by all of the l/o-handlers . When a request to read or write is serviced 
by the i/o-processor , an l/o-handler is given exclusive permission to interact 
with a specific memory module. Depending on the request involved, the l/o- 
handler can either write from the PEs' local memory to a particular memory 
module, or read from a particular memory module and write to its PEs local 
memory. 

6.2 Effect of Windowing 

The simulated shared memory system, described above, was employed to 
examine the performance of WBSOR as the window size was varied. The 
simulations were carried out for a system consisting of eight processors and 
eight memory modules and also for a system with eight processors and one 
memory module. Henceforth the former system is referred to as Machine A. and 
the latter as Machine B. 
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In the simulation runs described below, it is assumed that a model domain 
decomposition is applied where a uniform grid is divided into strips. Each 
strip has 1000 mesh points on the boundary. The time required to send each 
message, sent over a given link in the shared system, is assumed to be 1 
microsecond plus 0.025 times the number of bytes in the message. This 
corresponds to a bandwidth of 40 Mbytes per second per 32 bit wide 
communication channel. As mentioned earlier, the simulations explicitly take 
into account the communication requirements of the problem, including the 
queuing effects on the communication delays and the changes m the data 
requirements that occur when a window shifts upwards at the beginning of a new 
timestep. Thus, although the bandwidth of the channels is fixed, the problem 
parameters affect the communication delays m the system as a whole, and they 
are accounted for in these simulations. In these experiments it is assumed 
that the block advancement times for all blocks are identical in all the 
processors. A block advancement time is defined as the time it takes to 
perform computations for a single timestep during an iteration over the 
window, once the data for that timestep are available. Results are presented 
here for the block advancement times of 0.5 millisecond, l millisecond, and 5 
milliseconds. Finally, in these simulation experiments, it is assumed that an 
equal number of iterations are required over each timestep. 

The variation of processor utilization as a function of window size when 
one and two blocks are assigned to each processor of Machine A is depicted m 
Figure 6 and Figure 7, respectively. Figure 8 and Figure 9 show the same for 
Machine B. In the case of Machine A, the utilization increases as the window 
size is increased, when either one or two blocks are assigned to a PE. This 
is true for all the block advancement times considered. These improvements in 
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processor utilization with window size taper off for larger window sizes. As 
the block advancement time is increased, the relative effect of communication 
delays decreases, and the processor utilization improves. 

In Figure 6, where only one block is assigned to a PE, the utilization for 
window size one does not increase above 0.5, regardless of the time needed to 
advance a block. This is so because here each PE has either a black or a red 
block. Ignoring the predictions that occur at the beginning of each timestep, 
black blocks require variable values from red blocks from the last cumulative 
microstep, and red blocks require variable values from black blocks from the 
same cumulative microstep. Therefore, a black block and its neighboring red 
blocks cannot advance simultaneously when the window size is one; thus, the 
processor must remain idle approximately half of the time even if the inter- 
processor communication were instantaneous. This restriction disappears when 
the window size is greater than one or when more than one block is assigned to 
a PE. When each PE is assigned one block and when the block advancement time 
is relatively small, a window size greater than one helps to some extent, and 
the utilization goes up, but the queuing effects soon catch up with the gain 
from higher window size. From Figure 7 it can be seen that, if the number of 
blocks assigned to each PE is increased from one to two, the effect of queuing 
delays is decreased and much higher utilizations are observed. 

The effect of the underlying hardware, specifically that of the number of 
memory modules in the shared memory, on the performance of WBSOR is observed 
when the number of modules is reduced to one (Figure 8 and Figure 9). As 
before the utilizations increase with block advancement time, but much more 
gradually. The effect of change in the window size is not felt until the 
block advancement time is large enough to include all the queuing delays at 
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the module which acts as a bottleneck. The block advancement times, above 
this threshold, show trends similar to those observed in the case of Machine 
A. Altering the window size affects the patterns of interprocessor 
communication, but does not change the amount of information that must 
eventually be communicated before the problem is completed and hence, under 
some circumstances, with a smaller number of memory modules, one may expect 
lower performance improvements through the use of increasing window size.* 

The average communication delay between each pair of processors was 
measured from the simulations for machines A and B when two blocks were 
assigned to each PE. The maximum of the average interprocessor communication 
delays for Machines A and B is shown in Figure 10. The block advancement time 
is assumed to be 1 millisecond. Note that the interprocessor communication 
time increases quite gradually with window size, when eight modules are 
assigned, but increases almost linearly with the window size when the machine 
consists of only one module. An approximate value for the upper bound of 
processor utilization can be obtained by substituting in (4) the maximum 
average time needed for the interprocessor communication that must take place 
between pairs of processors. These upper bounds for utilization, given by 
(4), are compared with those observed in the simulation experiments for 
Machines A and B in Figure 11. The upper bound calculated using the maximum 
average interprocessor communication delay in (4) approximates rather closely 
the results obtained from the simulations. Thus, the usefulness of windows m 
mitigating the effects of communication delays is demonstrated m a realistic 
simulated machine. 


A detailed analysis of the influence of hardware parameters on the algorithm 
performance will be published separately. 
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7. Experimental Results on Convergence and Computational Overhead 

Even though the spectral radii of the iteration matrices do not vary with 
window size, the computation time required to complete a problem may increase 
with window size for the following two reasons: 1) The number of iterations 
required to bring two successive approximations to within tolerance in a 
specific norm, in this case the maximum norm, may have a dependence on window 
size; 2) the computational work required per iteration is expected to increase 
with window size to a minor degree. Here the experimental results on the 
effect of window size on the number of iterations required and on the total 
computation time taken are presented. It will be seen that the cost increase 
is quite modest and is often outweighed by the increase in the processor 
utilization attributable to the application of windows. 

The results on overhead attributable to the use of windows were found to 
be similar for both WBJ and WBSOR, and hence the results pertaining to WBSOR 
are presented. The heat equation was solved using a 50 by 50 point mesh and a 
timestep of 0.001. The initial condition consisted of the first two modes of 
the equation. The equation was solved subject to Dinchlet boundary 
conditions. Iterations were continued until the maximum of the difference 
between two succeeding approximations at the first microstep in the current 
window was less than the given tolerance of IE-5. 

The domain was decomposed into blocks of different sizes in different 
experiments. The domain was divided into 5 strips that were each 50 by 10 
points, 10 strips that were each 50 by 5 points, and 25 strips that were each 
50 by 2 points. Square blocks that were each 10 points by 10 points were also 
considered. The equation was advanced 50 timesteps and the average number of 
iterations required to achieve the prescribed tolerance of IE-5 was calculated 
for each of the types of blocks. 
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For window sizes 1, 2, 3, and 4 the average number of iterations required 
to reach tolerance IE-5 is displayed in Figure 12 and Figure 13, when 50 x 2 
and 10 x 10 blocks are used respectively. It is clear that the average number 
of iterations required increases rather gently with window size. The number 
of iterations required by the first few timesteps of the problems investigated 
here grows relatively quickly with window size (Figure 12 and Figure 13). 
This effect is due to the cost involved m getting the multistep algorithm 
started . 

Once a multistep algorithm is underway, the quality of the approximation 
at a timestep is successively improved as the window creeps upward. The first 
iterations over a timestep that occur when the timestep is at the top of a 
window may be thought of as establishing a rough approximation. As the window 
moves upward, the relative position of the timestep in the window goes down 
and the approximation to the solution at that timestep is refined. In the 
first iteration over the first window, the initial condition of the parabolic 
equation is used as the initial value at the beginning of each timestep m the 
window. The gradient of rough to fine approximation as a function of the 
position of the timestep in the window develops as the solution is carried 
out . 

Figure 14 depicts the computational overhead involved in using windows of 
size greater than one in the solution of the above described heat equation. 
The solution time increases with the size of the window. Here the overhead 
was computed by timing the computer runs for a given block domain decomposi- 
tion when windows of size one through four were examined. The ratio of the 
time required to advance the problem 50 timesteps with window 1, to the time 
required to complete the problem with window greater than one, is plotted for 
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each domain decomposition. For windows of size 2 the overhead observed ranged 
from 0.02 to 0.07, for windows of size 3 the overhead ranged from 0.07 to 
0.13, and for windows of size 4 the overhead ranged from 0.09 to 0.21. The 
smallest overheads for each window size are seen when the domain is divided 
into 50 x 10 point blocks. 

It is clear that while the use of windows does not affect asymptotic 
convergence, there is some computational overhead involved in their use that 
increases with window size. The experimental results described here show that 
the overhead is quite modest for small windows. 


8. Conclusions 

This paper explores methods for efficient solution of partial differential 
equations on MIMD machines. The general objective of this work is to maximize 
multiprocessor performance by rearranging the order of computations of 
standard algorithms so that the effects of communication delays are 
ameliorated, but at the same time the resulting algorithms have favorable, 
well defined convergence properties. 

Using Jacobi and S0R point and block iterative methods as a basis, a new 
concept of windowing over several time-steps is developed. Both analytical 
and simulation results demonstrate the usefulness of windowing in decreasing 
the effects of communication delays on algorithm performance. The spectral 
radii of the iteration matrices of both of these new algorithms are equivalent 
to the spectral radii of the analogous standard methods [SALT85], The use of 
windows entails a small computational overhead which increases gradually with 
window size. It was observed that the computational overhead associated with 
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a window size of two was negligible and the benefits in increased utilization 
were substantial. 

Further investigations are being pursued in a number of ways. The concept 
of windowing can be extended to other iterative methods. The generalization 
of WBSOR to multicolor SOR, [ADAM82] or the orderings introduced by O'Leary 
I.OLEA84] would appear to be particularly straightforward. The concept of 
windowing may also be extended to apply to iterative methods in the solution 
of the equations arising from Newton-like schemes for the solution of systems 
of nonlinear algebraic equations. It also may be possible to extend the 
windowing concept to the solution by functional iteration of nonlinear 
equations that might be obtained in a method of lines solution to nonlinear 
parabolic equations. 
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Iteration until convergence over consecutive timesteps 
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Counting of cumulative microsteps 
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Heat equation — Dirichlet boundary conditions. 
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Heat equation — Dirichlet boundary conditions. 
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